bandit learning
LocallyDifferentiallyPrivate (Contextual)Bandits Learning
Further, we extend our(ε,δ)-LDP algorithm toGeneralized Linear Bandits,which enjoysa sub-linear regret O(T3/4/ε) and is conjectured to be nearly optimal. Note that given the existingΩ(T) lower bound for DP contextual linear bandits [35], our result shows afundamental difference between LDP and DP contextual bandits learning.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Germany > Berlin (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Machine Learning (0.95)
- Information Technology > Data Science > Data Mining > Big Data (0.47)
Locally Differentially Private (Contextual) Bandits Learning
We study locally differentially private (LDP) bandits learning in this paper. First, we propose simple black-box reduction frameworks that can solve a large family of context-free bandits learning problems with LDP guarantee. Based on our frameworks, we can improve previous best results for private bandits learning with one-point feedback, such as private Bandits Convex Optimization etc, and obtain the first results for Bandits Convex Optimization (BCO) with multi-point feedback under LDP. LDP guarantee and black-box nature make our frameworks more attractive in real applications compared with previous specifically designed and relatively weaker differentially private (DP) algorithms. Further, we also extend our algorithm to Generalized Linear Bandits with regret bound $\tilde{\mc{O}}(T^{3/4}/\varepsilon)$ under $(\varepsilon, \delta)$-LDP and it is conjectured to be optimal. Note given existing $\Omega(T)$ lower bound for DP contextual linear bandits (Shariff & Sheffet, NeurIPS 2018), our result shows a fundamental difference between LDP and DP for contextual bandits.
Bandit Learning with Implicit Feedback
Implicit feedback, such as user clicks, although abundant in online information service systems, does not provide substantial evidence on users' evaluation of system's output. Without proper modeling, such incomplete supervision inevitably misleads model estimation, especially in a bandit learning setting where the feedback is acquired on the fly. In this work, we perform contextual bandit learning with implicit feedback by modeling the feedback as a composition of user result examination and relevance judgment. Since users' examination behavior is unobserved, we introduce latent variables to model it. We perform Thompson sampling on top of variational Bayesian inference for arm selection and model update. Our upper regret bound analysis of the proposed algorithm proves its feasibility of learning from implicit feedback in a bandit setting; and extensive empirical evaluations on click logs collected from a major MOOC platform further demonstrate its learning effectiveness in practice.
Bandit Learning in Concave N-Person Games
This paper examines the long-run behavior of learning with bandit feedback in non-cooperative concave games. The bandit framework accounts for extremely low-information environments where the agents may not even know they are playing a game; as such, the agents' most sensible choice in this setting would be to employ a no-regret learning algorithm. In general, this does not mean that the players' behavior stabilizes in the long run: no-regret learning may lead to cycles, even with perfect gradient information. However, if a standard monotonicity condition is satisfied, our analysis shows that no-regret learning based on mirror descent with bandit feedback converges to Nash equilibrium with probability 1. We also derive an upper bound for the convergence rate of the process that nearly matches the best attainable rate for single-agent bandit stochastic optimization.
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany > Berlin (0.04)
- (3 more...)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.47)
Improved Analysis for Bandit Learning in Matching Markets
A rich line of works study the bandit learning problem in two-sided matching markets, where one side of market participants (players) are uncertain about their preferences and hope to find a stable matching during iterative matchings with the other side (arms). The state-of-the-art analysis shows that the player-optimal stable regret is of order O(K\log T/\Delta 2) where K is the number of arms, T is the horizon and \Delta is the players' minimum preference gap. However, this result may be far from the lower bound \Omega(\max\{N\log T/\Delta 2, K\log T/\Delta\}) since the number K of arms (workers, publisher slots) may be much larger than that N of players (employers in labor markets, advertisers in online advertising, respectively). In this paper, we propose a new algorithm and show that the regret can be upper bounded by O(N 2\log T/\Delta 2 K \log T/\Delta) . This result removes the dependence on K in the main order term and improves the state-of-the-art guarantee in common cases where N is much smaller than K . Such an advantage is also verified in experiments.
Locally Differentially Private (Contextual) Bandits Learning
We study locally differentially private (LDP) bandits learning in this paper. First, we propose simple black-box reduction frameworks that can solve a large family of context-free bandits learning problems with LDP guarantee. Based on our frameworks, we can improve previous best results for private bandits learning with one-point feedback, such as private Bandits Convex Optimization etc, and obtain the first results for Bandits Convex Optimization (BCO) with multi-point feedback under LDP. LDP guarantee and black-box nature make our frameworks more attractive in real applications compared with previous specifically designed and relatively weaker differentially private (DP) algorithms. Further, we also extend our algorithm to Generalized Linear Bandits with regret bound \tilde{\mc{O}}(T {3/4}/\varepsilon) under (\varepsilon, \delta) -LDP and it is conjectured to be optimal.
Reviews: Bandit Learning with Positive Externalities
The paper studies the interesting problem of learning with externalities, in a multi-armed bandit (MAB) setting. The main idea is that there might be a bias in the preferences in the users arriving on on-line platforms. Specifically, future user arrivals on the on-line platforms are likely to have similar preferences to users who have previously accessed the same platform and were satisfied with the service. Since some on-line platforms use MAB algorithms for optimizing their service, the authors propose the Balanced Exploration (BE) MAB algorithm, which has a structured exploration strategy that takes into account this potential "future user preference bias" (referred to as "positive externalities"). The bias in the preference of the users is translated directly into reward values specific to users arriving to on-line platform: out of the m possible items/arms, each user has a preference for a subset of them (the reward for this being a Bernoulli reward with mean proportional to the popularity of the arm) and the rewards of all other arms will always be null.
Reviews: Bandit Learning with Implicit Feedback
Summary: This work considers learning user preferences using a bandit model. The reward is not only based on the judgement of the user, but also whether the user examined the arm. That is feedback examination * judgement In particular, if a user does not examine an arm, lack of feedback does not necessarily indicate that the user does not "like" the arm. This work uses a latent model for the (unobserved) examination of arms, and posits that the probability of positive feedback (binary) can be expressed as a product of the probability of examination (logistic) and positive feedback (logistic). The work proposes a VI approach to estimating the parameters, and then use a Thompson Sampling approach from the approximate posterior as policy. This allows them to use machinery from Russo and Van Roy to obtain regret bounds.
Reviews: Bandit Learning in Concave N-Person Games
Context: It is a classic result that empirical frequencies of actions for players playing regret minimization algorithms converges to a coarse corr equilibrium. CCEs are not necessarily desirable solution concepts because they sometimes admit irrational behavior. For monotone games, it is known that the empirical frequencies converge converge to nash equilibrium for agents playing FTRL. Recently, Mertikopoulos et al proved that the sequence of plays for FTRL converges to nash for games -- they prove something more general that goes beyond concave potential games, in fact. This work considers that case when each agent can only observe bandit feedback.